Red Wine Quality Exploration by Jiemin Wang

Background

In this project, we will explore a dataset containing 1,599 red wines with 11 properties of the wine. The dataset also contains quality of each wine rated by at least 3 wine experts. The purpose of this project is to practice EDA (Exploratory Data Analysis) by analyzing the dataset to find out which chemical properties influnce the quality of red wines.

Attributes and Descriptions

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

Univariate Plots Section

We first get some overview of the dataset.

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

X is the row number and isn’t necessarily relevant in the exploration. We removed this column.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Summary of the dataset:

  • The dataset contains 1599 observations and 13 variables with 11 properties of wines, 1 output variable (quality) and 1 unique identifier (X).
  • The 11 properties of wines are all numerical values.
  • quality is a discrete variable which ranges from 0 to 10. However, our dataset only contains quality values from 3 to 8.
  • fixed.acidity, volatile.acidity and citric.acid are different types of acids in wines.
  • free.sulfur.dioxide is the subset of total.sulfur.dioxide.

Then, we take a look at distributions of single variables.

quality

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

We transformed quality to ordered factor.

##  Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...

As we can see from the results above, although quality ranges from 0 to 10, the red wines in our dataset have discrete scores from 3 to 8. Overall, the quality score follows normal distribution to some extent, with most wines of score 5-6 and the rest of 3, 4, 7, 8.

fixed acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The above diagram seems to be long-tailed. We plot it on base 10 logarithmic scale.

After transforming the data to log10 base, the diagram seems to be normal distribution.

volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The above diagram seems to be long-tailed with some outlier.

The diagram seems to be normal distribution after plotting on base 10 logarithmic scale.

citric acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The above diagram shows that there seem to be a lot zero values of citric.acid in our dataset.

## [1] 132

In order to better understand different acids in wine making, I searched online and found the link of Acids in Wine. Fixed acid, volatile acid and citric acid are different types of acids in wine. According to the description of each of them, fixed acid refers to most acids involved with wine, which is also called nonvolatile acid; Volatile acid level cannot be too high in wine, otherwise, it can lead to an unpleasant, vinegar taste; Citric acid usually is found in small quantities in wine and it can add “freshness” and flavor to wines. It makes sense that the amount of both volatile and citric acids in wines are much smaller than that of fixed acid. It is possible that for some wines, the citric acid level is too small to be detected or the data has been rounded to zero value.

residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The above diagram shows that the residual.sugar has some extreme outliers. We need to take these ourlier into account in further analysis.

Excluding the outliers, residual.sugar data is more of normal distribution.

chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The chlorides data has some extreme outliers. We need to take them into account in further analysis.

Excluding the ourliers, chlorides data is more of normal distribution.

free sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The free.sulfur.dioxide data is skewed to the right.

total sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

total.sulfur.dioxide data is skewed to the right and has some extreme outliers. According to the description of attributes, free.sulfur.dioxide is included in total.sulfur.dioxide. total.sulfur.dioxide consists of free and bound forms of SO2. From the obove two diagrams, free.sulfur.dioxide and total.sulfur.dioxide have similar distribution. We are also interested in the bound.sulfur.dioxide and an additional variable will be created to help the investigation.

density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

The density follows normal distrbution with few outliers.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The pH follows normal distrbution with few outliers.

sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The sulphates data is long-tailed and has some outliers.

Excluding outliers, sulphates data is more of normal distribution.

alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The alcohol level is from 8.4% to 14.9%. Most wines have alcohol of 9.5% and the average alcohol level is 10.42%. From the diagram above, alcohol data is skewed to the right.

Univariate Analysis

What is the structure of your dataset?

The structure of the dataset has been provided in the “summary of the dataset”.

What is/are the main feature(s) of interest in your dataset?

We are interested in investigating which features influence the quality of wines. By observing the distribution of single variable, we cannot tell which feature(s) determine the quality. Further exploration is needed.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Further exploration is needed.

Did you create any new variables from existing variables in the dataset?

  1. We combined fixed.acitidity, volatile.acitidy and citric.acid to acid to investigate the overall influence of acids in wine quality.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.270   7.827   8.720   9.118  10.070  17.050

The acid data seems to be normal distribution.

  1. bound.sulfur.dioxide is created to investigate the influence of other dioxide except free sulful dioxide.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   12.00   21.00   30.59   39.00  251.50

Just like total.sulfur.dioxide and free.sulfur.dioxide, the bound.sulfur.dioxode is skewed to the right with some extreme outliers.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

We observed some variables have long-tail distribution with outliers. For some of them, we plot the diagram in base 10 logarithmic scale to get a normal distribution. Detailed information can be found in the above section. We choose not to tidy or adjust the form of the data here since we would like to explore relations between variables in the next sections using the original data. And we will transform the data format in the next sections when needed.

Bivariate Plots Section

To understand the relations between different variables, we compute and plot correlation between each pair of them.

To get a more intuitive view, we plot the correlation as following:

Based on the above diagrams, we examined the relations between all variables in the dataset and found relatively strong correlations with wine quality:

  1. alcohol (0.48)
  2. volatile.acidity (-0.39)
  3. sulphates (0.25)
  4. citric.acid (0.23)
  5. bound.sulfur.dioxide (-0.21)

Note: Rank based on significance level and strength of correlation.

Then, we further investigate the correlations between quality and some of the transformed variable. Specifically, we compute the correlation values using the base 10 logarithmic scale of some variables discussed in Univariate Plots Section.

##              alcohol_log10     volatile.acidity_log10 
##                 0.47698109                -0.39124918 
##            sulphates_log10 bound.sulfur.dioxide_log10 
##                 0.30864193                -0.20830074 
##            chlorides_log10 total.sulfur.dioxide_log10 
##                -0.17613996                -0.17014272 
##        fixed.acidity_log10  free.sulfur.dioxide_log10 
##                 0.11423756                -0.05008749 
##       residual.sugar_log10 
##                 0.02353331

We can see that on base 10 logarithmic scale, some variables have stronger relations with wine quality. By setting the threshold to 0.3, we got the following variables influencing wine quality:

  1. alcohol (or alcohol_log10) (0.48)
  2. volatile.acidity (or volatile.acidity_log10) (-0.39)
  3. sulphates_log10 (0.31)

We also use boxplot to show relations between the top 9 relatively high correlated variables with quality and outliers.

Density estimate for top 4 relatively high correlated variables with quality.

We also would like to know the correlations between other variables. It is possible that some variables are highly correlated and the highly correlated variables can either both “good” for wine or “bad” for wine.

Correlation between other variables

##                     row               column        cor p
## 67        fixed.acidity                 acid  0.9963844 0
## 85 total.sulfur.dioxide bound.sulfur.dioxide  0.9576864 0
## 69          citric.acid                 acid  0.6904382 0
## 75                   pH                 acid -0.6834838 0
## 29        fixed.acidity                   pH -0.6829782 0
## 74              density                 acid  0.6755958 0
## 2         fixed.acidity          citric.acid  0.6717035 0
## 22        fixed.acidity              density  0.6680470 0
## 21  free.sulfur.dioxide total.sulfur.dioxide  0.6676664 0
## 3      volatile.acidity          citric.acid -0.5524957 0
## 31          citric.acid                   pH -0.5419042 0
## 53              density              alcohol -0.4961796 0
## 84  free.sulfur.dioxide bound.sulfur.dioxide  0.4251489 0
## 41            chlorides            sulphates  0.3712605 0
## 24          citric.acid              density  0.3649471 0
## 25       residual.sugar              density  0.3552834 0
## 36              density                   pH -0.3416989 0
## 39          citric.acid            sulphates  0.3127700 0
## 33            chlorides                   pH -0.2650261 0
## 38     volatile.acidity            sulphates -0.2609867 0

The above results show that some pairs of variables have strong correlations.

  • fixed.acidity, volatile.acidity, citric.acid and acid have relatively high correlations. It makes sense since they are all different types of acids and acid is the sum of the other three acids. Wines with high level of fixed.acidity more likely have high level of citric.acid and vice versa. It also applys to other pairs of acids. We also notice that volatile.acidity has strong correlation with wine quality and the other three acid variables have weaker correlations as well.
  • pH has relatively strong correlations with acid variables. As we know, pH is a numerical scale to specify acidity. Therefore, the correlations between them make sense. However, although acid variables have correlations with wine quality to some extent, pH doesn’t show significant correlation. This brings a question that if A correlates with B, B correlates with C, does A correlates with C as well? We found interesting answers here. Obviously, A does not necessarily correlates with C, which explains our results.
  • free.sulfur.dioxide, bound.sulfur.dioxide and total.sulfur.dioxide have high correlations. It is because the former two variables are subset of the latter one. bound.sulfur.dioxide and total.sulfur.dioxide have similar correlation level with wine quality. free.sulfur.dioxide has much weaker correlation with quality variable.
  • density has relatively strong correlations with acids variables except volatile.acidity. It is interesting to observe that the correlations between density and the acids variables are all obove 0.3 but density and volatile.acidity have insignificant correlation.
  • density has correlations with alcohol, pH and residual.sugar to some extent. alcohol is an important variable in influencing wine quality.
  • chlorides and sulphates has relatively high correlation. The two variables in base 10 logarithmic scale have high correlations with wine quality.

fixed.acidity, citric.acid, volatile.acidity, acid

From the above diagrams, we see that acids variables have relatively high correlations. They are all different types of acides in wine. It is also interesting to know that fixed.acidity and citric.acid are positively correlated, while the other two pairs of acids are negatively correlated.

acid vs. fixed.acidity, citric.acid, volatile.acidity

acid is the sum of the other three types of acids variables. It makes sense that they are highly correlated with acid. The correlation between fixed.acidity and acid is very strong (0.996) and it is because fixed.acidity is the main component of acid. Also, we should notice that volatile.acidity is negatively correlated with acid.

pH vs. acid, fixed.acidity, citric.acid, volatile.acidity

pH is a numerical scale to measure the level of acids and we can see that pH has relatively strong correlation with different types of acids variables. By definition of pH, we know that the higher level of acids, the lower pH value. We notice that volatile.acidity has positive correlation with pH which means the higher volatile.acidity level, the higher pH. It is possible that the pH level in wine is mainly determined by the other types of acids and the amount of volatile.acidity is too small to influence pH level.

density vs. acid, fixed.acidity, citric.acid, volatile.acidity

Similar to pH, density has relatively strong correlation with acids variables. All correlations are positive except the correlation with volatile.acidity.

density vs. alcohol, pH, residual.sugar

density also has correlation with alcohol, pH and residual.sugar to some extent. The correlations between alcohol and density, pH and densitiy are negative, while the correlation between residual.sugar and density is positive.

chlorides vs. sulphates

sulphates and chlorides have correlation (0.37) to some extent. It is not very high in general but relatively high compared to other pairs of variables in our dataset.

free.sulfur.dioxide, bound.sulfur.dioxide, total.sulfur.dioxide

We can see that sulfur.dioxide variables have relatively strong correlations which makes sense since they are different types of sulfur.dioxide. All pairs of variables have positive correlations.

Bivariate Analysis

Note: The analysis has been provided with the bivariate plots.

Multivariate Plots Section

Top 3 variables correlated with quality

alcohol, volatile.acidity and quality

As we can see, in general, wines with higher alcohol level and lower volatile.acidity level are of better quality. However, alcohol and volatile.acidity are not strongly correlated.

alcohol, sulphates_log10 and quality

Similarly, we can see that higher alcohol level and higher sulphates_log10 level lead to higher wine quality. But alcohol and sulphates_log10 are not strongly correlated.

volatile.acidity, sulphates_log10 and quality

Lower level of volatile.acidity and higher level of sulphates_log10 lead to good wine quality. Visually, the two variables are weakly correlated.

Other variables and quality

fixed.acidity, volatile.acidity and quality

The diagram shows that although volatile.acidity relatively strongly correlated with wine quality, fixed.acidity does not. Different acids play different roles in influencing wine quality.

pH, acid and quality

acid and pH are strongly correlated which is in accordance with common sense. However, neither of them show strong correlation with quality.

density, volatile.acidity and quality

density doesn’t correlate with quality or volatile.acidity.

density, alcohol and quality

density correlates with alcohol to some extent. Higher level of alcohol in wine of better quality. But density alone does’t influence quality much.

chlorides, sulphates

Although the correlation between sulphates and chlorides is 0.37, intuitively, it is interesting to see that they don’t show strong correlation in the diagram.

Multivariate Analysis

Note: They analysis has been provided with all the plots above.


Final Plots and Summary

Plot One

Description One

The diagram above is used to show the influences of different types of acids in wine quality. In general, higher level of acids results in higher wine quality. However, throughout our analysis, we found out that volatile.acidity has negative correlation with wine quality. In addition, volatile.acidity has strong correlation with wine quality.

Plot Two

Description Two

alcohol, volatile.acidity and sulphates_log10 are relatively strongly correlated with wine quality compared to other variables in our dataset. The above diagram shows the relations between these three variables and wine quality. Generally, higher alcohol level, lower volatile.acidity level and higher sulphates_log10 level lead to better red wines. However, the highest correlation we found in the variables with quality is alcohol 0.48, which is still not a very strong correlation.

Plot Three

Description Three

The above diagram shows the relations between alcohol and volatile.acidity, both of which are highly correlated with wine quality. alcohol is positively correlated with quality and volatile.acidity is negatively correlated with quality. However, the two variables themselves are not correlated. In addition, we should notice that both correlations are not very strong.


Reflection

In summary, from the whole analysis on the dataset, we make conclusions that the key properties in determining wine quality is alcohol, volatile.acidity and sulphates in base 10 logarithmic scale. We should notice that the correlation between these variables and the wine quality is not very strong and we cannot simply concludes there is clearly linear relations between these variables. Besides, the quality scores of all wines are based on subjective experts comments, which may be biased. We cannot easily tell the quality of wines just based on the dataset conclusions.

In addition, variables in our dataset can have more types of transformations and we may find more interesting relations between variables by further exploring the dataset. For example, ratios of different variables can be computed to see if they also influence the wine quality.

Through the analysis, the hardest part is to find out a way to investigate the relations between variables. The idea we used in our analysis is to compute correlations between all variables to see if there are strong ones. Based on the results, we can further explore the relations between variables. For example, inferential statistics can be used to infer from the dataset we have what properties can determine the wine quality. Besides, machine learning algorithms like regression can be used to train a model based on the key properties we found in our analysis to predict the quality of wine. However, this method only considers the relations between pairs of variables and we cannot tell more complex relations among variables. What if the relations are not linear? What if we transform the variable into another scale? What if we combine two variables (multiply, divide, etc.)? There are still a lot possibilities.

Quality of wines is not easy to determine. However, it is a very interesting topic to look into in the future.